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Abstract —A Bayesian framework is proposed to define flexible 
coupling models for joint tensor decompositions of multiple data 
sets. Under this framework, a natural formulation of the data 
fusion problem is to cast it in terms of a joint maximum 
a posteriori (MAP) estimator. Data driven scenarios of joint 
posterior distributions are provided, including general Gaussian 
priors and non Gaussian coupling priors. We present and discuss 
implementation issues of algorithms used to obtain the joint MAP 
estimator. We also show how this framework can be adapted 
to tackle the problem of joint decompositions of large datasets. 
In the case of a conditional Gaussian coupling with a linear 
transformation, we give theoretical bounds on the data fusion 
performance using the Bayesian Cramer-Rao bound. Simulations 
are reported for hybrid coupling models ranging from simple 
additive Gaussian models, to Gamma-type models with positive 
variables and to the coupling of data sets which are inherently 
of different size due to different resolution of the measurement 
devices. 

Index Terms —Tensor decompositions. Coupled decomposi¬ 
tions, Data fusion. Multimodal data, Cramer Rao Bound, Big 
Data. 

1. Introduction 

In some domains such as neurosciences, metabolomics and 
data sciences, various data gathering devices are used to 
collect information on some underlying phenomena. Since it 
may occur that no data set contains a complete view of the 
phenomena, data fusion may be used to blend the different 
views provided by the complementary data sets, thus allowing 
a more comprehensive view. It is not then surprising that 
multimodal data fusion, i.e. fusion of heterogeneous data, has 
become an important topic of research in these domains [2]- 

[4]. 

One way of defining a framework for multimodal data 
fusion is to cast it as a problem of latent variable analysis. Vari¬ 
ations of the hidden variables are assumed to explain most of 
the variations in the measured data sets. Because the data sets 
are considered to be different views of the same phenomena, 
we suppose that the latent models are coupled through subsets 
of their variables. Here, coupling means statistical dependence 
between these subsets of variables. By exploiting this coupling 
in the joint estimation of the latent models, we expect that 
the information from one data set will improve the estimation 
accuracy and interpretability of the latent variables related to 
the other data set. 

An early example of the framework explained above dates 
back to Hotelling’s canonical correlation analysis (CCA) [5]. 
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In CCA a pair of subspaces is obtained such that the pro¬ 
jections of the data vectors on their corresponding subspaces 
are maximally correlated. It has been shown [6] that such an 
approach is equivalent to estimating two linear transformations 
of a shared latent white Gaussian vector from noisy measure¬ 
ments. While the linear transformations link the information 
in the data vectors within each data set, the underlying latent 
Gaussian vector allows the coupling between data sets. 

In a deterministic setting, coupled (or linked) factorization 
models for multiple multiway (W-dimensional) datasets were 
first described in [7] and they were repopularized recently in 
data sciences in [8]. The problem of joint nonnegative matrix 
factorization was considered in [8] under the constraint that 
one of the factors is shared by all matrices. In both cases, 
the coupling occurs through equality constraints on latent 
factors. Following these works, the framework of coupled 
tensor factorization was revisited in [9], [10]. Variations on this 
framework, such as tensor-matrix factorizations [11], [12] and 
more general latent models [13] have also been considered. 
Uniqueness properties and the development of algorithms for 
the exact coupled tensor decomposition problem are pro¬ 
posed in [14] and [15], while algorithms for the coupled 
tensor approximation problem under general cost functions 
are developed in [4], [16], [17]. More flexible models for the 
couplings have been proposed in [3]. Instead of considering 
equality constraints for the entire factors of a tensor model, 
only a few components or elements are constrained. In [18] 
the problem of coupled nonnegative matrix factorization is 
considered and a flexible coupling is proposed by assuming 
that the shared components are similar in or £2 sense. 
Recently, factors coupled through their numerical derivatives 
have been considered [19], which is a generalization of [11], 
but still a particular case of this work. 

In this article, we propose a generalization of the above 
flexible models for joint decompositions using a Bayesian esti¬ 
mation approach. This allows modeling approximate couplings 
and the derivation of performance bounds. 

Outline and contributions 

• After introducing our Bayesian framework, we present 
two general examples of coupling priors such as joint 
Gaussian priors and non Gaussian conditional distribu¬ 
tions. We describe two algorithms for joint tensor approx¬ 
imate decomposition, a modified least squares algorithm 
for the joint Gaussian model and a multiplicative update 
for a positive coupling model. 

In contrast to [17], we focus on stochastic models for 
the coupling between the factors in the decompositions 
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and not on separate stochastic models for each factor. 
As the model considered in [18], our model is more 
flexible than the coupled model proposed in [3], since the 
coupled variables do not need to be equal. Moreover, the 
joint Gaussian model and hybrid model that we propose 
are more general than similarity between factors, since 
our model allows different linear transformations on the 
coupled factors. This is useful if we want to fuse data sets 
which are measured in different domains, for instance one 
in time and the other in frequency, or which are sampled 
differently. 

• We discuss implementation issues related to this flexible 
coupling approach. In the case of a simple Gaussian 
coupling model, we propose a joint compression approach 
to deal with large data sets. To our knowledge, our work 
seems to be the first to discuss and propose a solution 
for the approximation of large coupled decompositions. 

• Since we define the coupled approximation problem with 
a Bayesian framework, we are able to obtain bounds on 
approximation variance in terms of mean squared error 
(MSE) using Cramer-Rao bounds (CRB). In the case 
where all parameters are random and both measurements 
and factors are Gaussian, we show how to obtain the 
Bayesian Cramer-Rao bound (BCRB) for a coupled ten¬ 
sor model, while in the case where only the coupling is 
random, we show how to obtain the Hybrid Cramer-Rao 
bound (HCRB). 

Although CRB for tensor models have been developed in 
[20] and [21], this article is the first to present bounds on 
the coupled tensor model. 

• We simulate the proposed algorithms to retrieve the 
factors of two tensor models coupled through one of the 
factors. The simulations are carried out for several types 
of coupling, including Gamma-type coupling of positive 
variables, and coupling of data sets which are inherently 
of different size due to different time resolution of the 
measurement devices. 

Gamma-type measurements can be dealt with the frame¬ 
work presented in [17], but not Gamma couplings as 
presented here. Similarly, the coupling of data sets with 
different sizes cannot be handled with the flexible models 
presented in [3] or [18]. Note also that this type of model 
appears naturally in multimodal data fusion since nothing 
guarantees that measurement devices of different nature 
will generate data sets with the same resolution. Our 
approach in this case can be seen as a generalization 
of the pan-sharpening approaches considered in remote 
sensing [22]. 

Notation 

To simplify the presentation, all quantities will be assumed 
to belong to the real field. Scalars (resp. vectors) are denoted 
by lower case, e.g. x, (resp. bold lower case e.g. x) letters. 
Matrices are denoted by upper case bold letters, e.g. X, while 
tensors by calligraphic letters, e.g. X. Elements of a given 
array are indicated by subscripts and are non-bold, e.g. Xijk. 
Vectorization of matrices is indicated by vec(-), and stores 


bijectively all entries of an array into a column vector column¬ 
wise [23]. The Kronecker product of two matrices X and Y 
is denoted by X 0 Y, while the Khatri-Rao product (column¬ 
wise Kronecker product) hy X Q Y [23]. The Hadamard 
(element-wise) product, division and (x-th) power of matrices 
are denoted by El, 0 and The left inverse of a full 
column-rank matrix M is denoted by and defined as 
I'M = The right inverse of a full row- 

rank matrix is M^ = M^(MM^)“^. Both are actually 
computed via the economy size singular value decomposition 
(SVD) in practice. We denote the trace of a matrix by tr(-). 
A vertical stack of matrices or vectors will be denoted with 
semi columns as [x\y\. A multilinear product of matrices 
A, B and C with same number of columns is denoted by 
(A,B,C); in other words, it is the 3-way array defined 
by {A,B,C)ijk = Y.r^irBjrCkr- Y(€j denotes the z-th 
matrix unfolding^ of tensor y, which is obtained by stacking 
all matrix slices along mode z; see a detailed definition in 
e.g. [24] or [25]. In particular, if = {A,B,C), then 
Y ( 3 ) = C{B © A)~^, and if y is of size I x J x K, then 
Y ( 3 ) is K X I.J. 

II. Coupled decompositions: a Bayesian approach 

Consider two arrays of noisy measurements, y and y', 
which can be tensors of possibly different orders and dimen¬ 
sions. Arrays y and y' are related to two parametric models 
characterized by parameter vectors 0 and 6', respectively. 

Eor instance, if is a matrix (a second order tensor) to be 
diagonalized, the model can be the SVD -I- E, 

so that 6 = [vec {[U; V]); diag{S}] and E is a noise matrix. 
If y' is a third order tensor, its Canonical Polyadic (CP) 
decomposition [26], [24] is written as y' = {^A', B', C') -\- 
5', meaning that A'„B'jrC'kr + = 

vec {\A'] B'; C'\), and again £' is a noise array. 

In the case where 6 and 6' are not coupled, they can be 
obtained separately from each array, generally non uniquely 
in the matrix case. On the other hand, if they are coupled, 
then the arrays can be processed jointly and parameters are 
then uniquely estimated, under mild additional assumptions 
[14]. Here we consider that the coupling between 6 and O' is 
flexible: for example, we could have V k, B, or V ~ WB 
for a known transformation matrix W. To formalize this we 
assume that the pair 9, O' is random and that we have at our 
disposal a joint probability density function (PDE) p{0,0'). 

Maximum a posteriori estimator: since the pair 0, O' is 
random, we adopt a maximum a posteriori (MAP) setting^. 
In this setting, the approximation is defined as 

&rgm&yip{0,0'\y,y') = argrrraxp{0,0' ,y,y') 
e,e' e,e' 

= argmin T(0, O') 
e,e' 

where T{6,0') = — \ogp{0,0',y,y'). Conditioning on the 
parameters leads to a cost function that can be decomposed 

’Also referred to as “mode-z matricization.” 

^We could also consider a minimum mean squared error setting but then 
we would need to evaluate p{'y,'y'), which is typically cumbersome. 
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into a joint data likelihood term plus a term involving the 
coupling: 

T( 0 , e') = - \ogp{y, y'\e, e') - \ogp{e, e'). (i) 

Under the following simplifying hypotheses underlying the 
Bayesian approach, the MAP estimator is given as the mini- 
mizer of a three-term cost function : 

argmin T(0, O') = 

0M’ 

argmin [-\ogp{y\e) - logp{y'\6') - logp{6,e')] . (2) 

9 , 0 ' 

HI Conditional independence of the data: the data arrays 
y and y' are independent of O' and 0 respectively, 
if they are conditioned on 0 and O'. We suppose 
also that they are conditionally independent on each 
other. This results in the following p{y\y' , 0, O') — 

piy\o) and piy'\y, o, o') = p{y'\o'). 

H2 Independence of uncoupled parameters: all parame¬ 
ters except the coupled parameters are independent. 
In the joint SVD and CP cases, this means that 
p{0, O') = p{V, B)p{U)p{'E)p{A)p{C) if coupling 
exists only in the second mode. 

H3 On the priors: trivially, the joint distribution of the 
coupled parameters, e.g. p{V, B) needs to be known 
or, at least, one of the conditional distributions, 
namely p('UIJB) or p{B\V). The marginal priors on 
the uncoupled and on the conditioning parameters 
(p{U), ..., p{C)) are assumed either to be known 
or to be flat on some domain of dehnition. 

H4 Likelihoods: the conditional probabilities (or likeli¬ 
hoods) p{y\0) and p{y'\0') are known, at least up 
to a scale parameter. In a MAP setting, this indirectly 
sets the weights which will be given to each data 
array in T {0,0'). 

Hypotheses H3 and H4 are assumed so that all terms 
in this cost function are dehned. H2 is assumed so that 
the probabilistic dependence of the parameters dehnes the 
coupling between various latent models. 

The framework presented above contains as specihc cases 
the following: 

• Hybrid estimation: if we consider that the conditional 
distribution p{V\B) is known and that B is determin¬ 
istic and dehned in a domain LIb, then we have to 
minimize Th{0,0') = — logp(3^|0) — \ogp{y'\0') — 
\ogp{V,B') subject to B G LIb, which is a hybrid 
(random/deterministic) coupled approximation problem. 
This is the framework implicitly considered in [18]. 
Note that to enforce symmetry of roles of the coupled 
parameters, for example if V and B are noisy versions 
of each other, we can consider that they are similar 
versions of the same deterministic latent factor V*. This 
symmetric approach has the advantage that it can be 
trivially extended to more than two coupled models. 

• Coupled approximation: the usual assumption that the 
coupled factors are equal, V = B, can be obtained by 
setting a Dirac’s delta prior p{0,0') = S{B — V). The 


MAP problem becomes a MLE problem with equality 
constraints 

maximize logpjiV; 0) + \ogp{y'; O') 

with respect to (w.r.t.) 0, O' 

subject to V = B 

(3) 

which corresponds to the coupled approximation frame¬ 
work with general cost functions. Different versions of 
this approach are proposed in [4], [8], [16]. Under the 
assumption of Gaussian likelihoods, we have the standard 
coupled approximation framework [12]. 

• Exactly coupled decompositions: additionally to exact 
coupling of one of the factors, if the data sets are not 
corrupted by noise, the likelihoods are also products of 
delta functions and we have to solve an exactly coupled 
decomposition problem [14], [15]. 

In the next section we give many examples of possible joint 
decomposition problems and their MAP or hybrid MAP/MLE 
objective functions. 

HI. Elexible coupling models 

In what follows we consider that the parametric models 
underlying the data arrays are two CP models X = {A, B, C) 
and X' = (a', B', C') with dimensions /, J, K and J', 
K' and number of components (i.e. number of matrix columns) 
R and R' respectively. We consider that the coupling occurs 
between matrices C and C . We illustrate this framework 
with three different examples: general joint Gaussian, hybrid 
Gaussian and non Gaussian models for the parameters. 

A. Joint Gaussian modeling 

A general joint Gaussian model comprising coupled and 
uncoupled variables is given by the following expression: 

M [0- O'] ='Bu + p (4) 

where M is a matrix dehning the structural relations between 
variables, tr is a white Gaussian vector with zero mean and 
unit variances, S is a diagonal matrix of standard deviations 
and p is a constant vector. A condition for the pair {0, O') to 
dehne a joint Gaussian vector is the left invertibility of M. 
Under this condition we have [O; O'] ~ J\f{"^Mp, R}, where 
R = (tM)SS(tM'^) is the covariance matrix of the joint 
vector. 

Considering that the CP models X and X' are measured 
with zero mean independent additive Gaussian noise, with 
variances cr^ and a'f, respectively, we have the following MAP 
objective function: 

T {0,0') = 

^\\y-{^A,B,C)\\l + ^\\y'-{A', B',C')\\l 
+ |l[0;0']-tM/r|l^ 

(5) 

where || • ||f II • ||r ^^e respectively the Erobenius norm and 
i?-weighted Erobenius norm. We give below two examples of 
applications of this approach. 
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Shared components: a usual problem in multimodal 
data fusion is that some components are not present in 
all modalities, thus we have some components which are 
shared and some which are specific to each modality. Suppose 
® = Wu'i Cl] and 9' = [6'^; d^], where Ci and c'l are coupled, 
whereas 9u and Oj are not. 

Supposing zero mean marginal Gaussian priors, we may 
write 

9 '^ = ( 6 ) 

Cl - c\ = ScMl, c'l = (7) 

where u, u', Ui and u'^ are zero-mean unit variance indepen¬ 
dent Gaussian variables, and where Sj. and S], define 

corresponding covariance matrices. With an obvious definition 
of M and S, equations (6-7) can be grouped in a unique block 
equation of the form: 


- 6>„ - 


u 

Cl 

0;. 

= E 

Ui 

/ 


u 

L c'l J 


_ '^1 _ 


Dynamical models: in some cases, more than two data ar¬ 
rays are present and, as a consequence, there is a large number 
of degrees of freedom in the definition of the couplings. If the 
data arrays are measured in time, then a natural coupling can 
be defined through a dynamical model on the factors [27]. 
Finding the MAP estimator in this case amounts to solve a 
smoothing problem, since the temporal relations embedded in 
the dynamical model will impose some smoothness on the 
estimated factors. 

B. Hybrid Gaussian coupling 

When no prior information is given on some parameters, full 
joint Gaussian modeling is inadequate, since we cannot define 
means and variances of the model. Instead, we consider these 
parameters as deterministic, while the other parameters are still 
stochastic with Gaussian priors. We call this model hybrid 
Gaussian. This is the main focus of this paper since most 
applications will fall in this category of coupling. Indeed, it 
covers the scenario where only one factor, say C, is coupled to 
another, say C', with transformation matrices and independent 
and identically distributed (i.i.d.) Gaussian additive noise: 

HC = H'C + r, ~ AA( 0 , al) ( 8 ) 

for some transformation matrices H and H'. If relation ( 8 ) is 
the only known relationship between parameters, it means that 
only the conditional probability p{C\C') is known, and that 
C' is either deterministic or has a flat non-informative prior. 

This coupling model cannot be written with zero noise 
variance if the transformation matrices are tall (i.e. more rows 
than columns), as the set of feasible parameters would be 
reduced to the trivial set. This is important because using zero 
noise variance as a working hypothesis as in [3] leads to a 
bias in the hybrid Gaussian coupling estimation setting when 
transformation matrices are tall. 

Even more critical is the norm of the coupled factors. 
Without proper normalization, the CP decomposition is in¬ 
variant w.r.t. the norm of columns of C and C' as it can 


be incorporated in the other factors. This means that the 
variance of F is not well defined when defining the joint tensor 
decomposition problem with ( 8 ). It is necessary to add the 
information on the norm of columns of the coupled factors 
in the hybrid model ( 8 ), or to consider the coupling variance 
as an unknown parameter. This will also impact algorithms 
designed in Section IV. 

Sampling a continuous function: as an example, an 
important problem in multimodal data fusion is related to sam¬ 
pling. Different measurement devices have different sampling 
frequencies or even different non uniform sampling grids. In 
some situations, the continuous functions being measured can 
be approximated by a common function c{t). For two sampled 
vectors c and c', their relation with the continuous function 
can be obtained with an interpolation kernel (see the general 
description in [28]) 

K K' 

cit) « X! tk) « ty ) (9) 

k=l k’ = l 

for some kernels and and for sampling times 

G Ij ••• ) k\ and {tk>,k' € 1, ■ ■ ■ , K'}. Therefore, 
we can impose a new common sampling grid of size L where 
both interpolations should match. This leads to the linear re¬ 
lations He « H'c', where Hik = h{ti,tk), = h{ti,tk') 
and {ti,l G 1, • • • , L}- For example, in the coupled CP 
models, when C and C' have different dimensions due to 
different sampling rates, their coupling is governed by: 

He = H'e' + 7 . (10) 

C. Non Gaussian conditional coupling 

Non trivial couplings expressing similarity between the 
factors C and C' can be addressed by considering that C' is 
deterministic and that the coupling is given by a conditional 
non Gaussian distribution, p (CjC^). 

Impulsive additive coupling: as a first example, we can 
assume that each element in C is a version of C' corrupted 
by i.i.d. impulsive noise: 

( 11 ) 

where Tij follows a Faplacian distribution p{Tij) = 
(l/25c) exp(—iFyl/Jc) with scale parameter dc or a Cauchy 
distribution p{Tij) = 1 /{'k5c[I + {Wij/Scf']}. The objective 
functions to be minimized are then 

T {9,9') = - logp{y\9) - \ogp{y'\9')+ 

+(1/24)||C-C'||i (Faplace) 

- E log{l + [{Cij - Ck)/5cY} (Cauchy) 

( 12 ) 

where |j • ||i is the norm. The first penalty was considered 
in [18] in a collective matrix factorization context. Both cost 
functions imply a small number of large discrepancies between 
C and C'. 
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Positive general coupling: when Cij > 0 and > 0, 
an additive noise in the coupling may not be the best option, 
since, to ensure Cij are positive, the support of the additive 
term has to depend on the values of which is not a 
realistic assumption on a perturbation term. Therefore, other 
alternatives naturally ensuring positiveness can be considered. 
For example, the Tweedie’s distribution [29] which contains 
Poisson, Gamma and inverse-Gaussian distributions as special 
cases (the Gaussian distribution is a limiting special case). 
In general, the PDF of the Tweedie’s distribution has no 
analytical form, thus we cannot directly use it to write down 
a coupling term in the MAP objective function. However, 
if we consider that the coupling between Cij > 0 and 
C'ij > 0 is strong (dispersion (j)^ is small), then a saddle point 
approximation can be used [29]: 


where Pc is a shape parameter {Pc = 1, 2, 3 for the Poisson, 
Gamma and inverse-Gaussian distributions respectively) and 
is the beta divergence [30]: 


dpAC^^CC) 




k(/?c) 


^2—/3c 


—1 

^ij 


C,,{2-pc)+C'^{l-Pc)\ 

(14) 


where k{P) = (1 —/3)(2 —/3). Under this conditional distribu¬ 
tion, assuming that the CP model is positive (A, B, A' and 
B' are also positive) and that the data distributions are also 
of Tweedie type with shapes /? and /?' and dispersions p and 
p', we have the following hybrid MAP/MLE objective: 


T(0,6»') = 


UK 

dl3{yijk\^ijk) 

1 

i'j'k' 

^13' {Vijk 1 k ) 

ijk 

KR 


ijk 


+ ;^ [(/?/2)iog(a,) + {1/Pc)dp{c.pc[j)]. 

ij 


IV. Algorithms for coupled tensor 

DECOMPOSITIONS WITH FLEXIBLE COUPLINGS 

To fix the ideas, our discussion will be limited to coupled 
CP models where the coupling occurs in the third mode, 
that is, between the third factor matrices. An algorithm for 
the joint Gaussian coupling revolving around vectorization of 
factors matrices - the coupling itself being applied to vector 
6 - has been derived, but it is bound to be computationally 
heavy since it requires the solution of large linear systems 
of equations at each iteration. Thus, it will not be presented 
in this work. An algorithm for the C coupling model has 
already been presented in [18], and its counterpart for the 
Cauchy distribution can be obtained with a gradient approach, 
since the coupling prior distribution is smooth. We will refrain 
from detailing these algorithms. In this section we focus 
on algorithms to minimize the objective functions for the 
remaining presented models, namely, hybrid Gaussian models 
with i.i.d. Gaussian likelihoods (8) and the Tweedie’s model 

(15). 


A. Hybrid Gaussian modeling and alternating least squares 
(ALS) 

Under hybrid Gaussian coupling assumption, the MAP 
estimator is the minimizer of the following cost function^: 

T{6,6') = \\\y - {A,B,C)\\% + 

1 T (16) 

— Iiy - {A\B\CprF + -^\\HC - H'CTf- 

G ^ G^ 

^ n ^ c 


To minimize (16), it is possible to resort to standard algorithms 
matched to convex optimization, which generally requires to 
tune additional parameters like stepsize. Here we chose to 
resort to a modified version of the widely used ALS [26], 
which is simple to implement. 


Joint update of coupled factors: to build the ALS algo¬ 
rithm, the gradient of T w.r.t. each factor matrix needs to be 
computed, and set to zero. Lor uncoupled factors, this leads to 
the standard uncoupled ALS update rules, see Algorithm 1. 

Now for the coupled factors C and C', a simultaneous 
update ensuring symmetry in the treatment of both data sets 
requires to solve the following system: 


tCD - 


H^H'C 


4h'~^H'C' +"'AiC'D' - 




(17) 

where JD = (B © 0 A), and JD' = 

{B'QA'y{B' ©A'). This system is costly to solve 
for {C,C'}. It contains indeed two coupled Sylvester 
equations. A straightforward approach teaches us to vectorize 
parameter matrices C and C' and to solve the following 
stacked large linear system of equations w.r.t. {C,C'}: 


G vec([C; C'j) = vec 


L^n 


where G — [Gii, G 12 ; G 217 G 22]3 with 

Gii= ^Jr(^H'^H+^D'^0Ik 
G 22 = ^Jr ® H'^H' + ® Ik 

Gi2 = © H^H' 

G2i= -^Jr®H'^H. 


(18) 


(19) 


Remark. Note that we present here an alternating algorithm 
for one example of a hybrid Gaussian coupling. Another 
possible type of coupling is when the transformation matrices 
act on the right of the factors, i.e. when the transformation is 
in the component space. Then the previous algorithm should 
be modified because full vectorization is not needed, as the 
Sylvester structure of the two coupled equations becomes a 
simple stacked linear system. 

Full alternating solution: a full alternating solution can 
be obtained by solving sequentially the two equations in (17). 
At the k-th iterate we obtain the estimate G^ by solving 


\h^hg + \gd 


Xh^h'c[ 


-1 + 


V(3)B 


( 20 ) 


^For numerical reasons, if ctc is much smaller than an or cr^, it will be 
preferred in practice to set HC = H'C'. 
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w.r.t. C, where Cf._i is the previous estimate, considered as a 
fixed constant matrix for C . The estimate is then obtained 
by solving 




,C'D' = ^H'^H'Ck + 


4y(3)'£»' 


( 21 ) 


w.r.t. C'. This approach leads to a much faster computation of 
each iteration compared to the simultaneous update previously 
described, since any algorithm efficiently solving the Sylvester 
equation can be used [31], [32]. 


Scale ambiguity and normalization: since CP decomposi¬ 
tion does not directly impose scaling on the estimated factors, 
the coupling term in the MAP objective function can be 
decreased by rescaling the coupled factors. Therefore, without 
proper normalization on the factors updates, the norms of the 
columns of the coupled factors C and C' may tend to zero. In 
addition, if the norms of the columns of C and C' tend to zero, 
the coupling term will become negligible when compared to 
the likelihood terms. As a consequence, the updates will tend 
to the decoupled solution. Moreover, will not represent the 
actual variance in the coupling model, and a scaling matrix 
should be introduced in the covariance matrix of T. 

A first solution to overcome this problem would be to nor¬ 
malize the columns of C directly in the designed algorithms, 
for example using the or ^2 norms. But this means adding a 
norm constraint on the coupled factor which is already heavily 
constrained by the coupling relationship. We thus made the 
choice to normalize A and B to indirectly impose some 
normalization of C, and only normalize A' since C' will be 
determined partially by the coupling model. 


Initialization and stopping: the MAP objective function 
for the coupled CP model is non convex, as a consequence, 
multiple local maxima may be present. To obtain an estimate 
close to the MAP estimator (the global maximum) a standard 
approach is to carry out ALS multiple times with different 
initializations, and then pick the solution giving the best MAP 
objective value [33, pp.62-63, p.l22]. There are two options 
for initializing the coupled algorithms, namely starting directly 
from random guesses (cold start), or refining those guesses in 
two uncoupled ALS (warm start). We found that these two 
approaches are not equivalent, and starting from the random 
guesses may lead to larger bias and slower decrease of the 
cost function when the coupling is weak (see Appendix A). 
However, before initializing the coupled ALS procedure with 
the uncoupled results, it is necessary to correct the permutation 
ambiguity which are intrinsic to the CP model. To achieve this, 
one of the estimated CP models is fixed and the components 
of the other CP model are permuted such that the value of the 
coupling term \\HC — H'C'Wj^ is minimized. This can be 
done by solving an assignment problem, which in practice is 
carried out in polynomial time with the Hungarian algorithm 
[34]. 


At each run, the algorithm stops at iterate k if either the 


relative decrease ATt = 


Tfc—Tfc_i 


achieves ATv, 


maximum number of iterates femax has been reached. 


Algorithm 1. ALS algorithm for the hybrid Gaussian coupled model 


Require: y, y\ 




O' 


1 : Apply two uncoupled ALS with random initializations to 
obtain Aq, Bq, Cq, Aq, ~ ' ~ 


Bq and Cq. 


2: Permute initialization factors columns to have minimal 


2 

F- 


^\HCo-H'Cq\, 

Evaluate the initial cost function Tq (16). 
Set k = 0, AT = -foo. 
while k < fcmax and ATfc > AT^in, do 
k;=kH-L 
Update Ak'. 

- ntV 

(1) ( C'/c-i © Bk-i 


Ak = Y 


Normalize columns of Ah. 

8 : Similarly update and Bf.. 

Normalize A^ and Bk, Bj^ remaining unnormalized. 
9: Solve (18) to obtain the joint update [Ck'jCf.] or (20) 

and (21) sequentially. 

10 : Compute cost function (16) and its relative AT^. 

11 : end while 

^ ^ ^ .. I .. I .. I 

12 : return Afc, Bk, Ck, A^,, B). and C^,. 


B. Coupled decompositions of large data sets 

Let us consider again the hybrid Gaussian problem (8) with 
the simple coupling 

C = C' + r (22) 

where F is an i.i.d. Gaussian matrix with variance of each 
element and C' has columns of given norm. A natural 
question that raises from previous discussions is scalability of 
coupled decompositions. If we disregard the coupling of the 
tensors, a common strategy to retrieve the CP models in a more 
computationally efficient way is to compress the data arrays, 
decompose the compressed tensors, and then uncompress the 
obtained factors [33, p. 92]. 

Uncoupled compression: to compress data arrays, the 
higher order singular value decomposition (HOSVD) [35] can 
be used. The HOSVD is given by three orthogonal factors U, 
V, W and a core tensor Q with dimensions Ri x R 2 x Rq. 
If we assume that data are noiseless, we have 

R1R2R3 

Aijk — ^ ^ UirCjs^YktQrst 

(23) 

R{R'^R'^ ^ ’ 

^ljk= E UiV'MtQr^t 

rst 

which we can denote as A = ([/, V, W)Q and X' = 
{U',V','W')Q'. The noiseless tensor models X and X' 
can be seen as array representations of multilinear operators, 
thus the HOSVD described above corresponds to changes of 
coordinates respectively from X and X' to Q and Q'. If 
the multilinear subspaces spanned by each mode’s vectorized 
slices of the original tensors are of low dimension, then the 
original tensors lie in the span of a small number of basis 
vectors in each mode, i.e. {I,J,K) > (i?i, i? 2 , and 
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> {R'l, R 2 , R';^). The core tensors Q and Q' are 
rotations of X and X' with many zeros after a certain index 
in each mode. 

This approach is valid with equalities in (23), if tensors X 
and X' are not corrupted by noise. If additive measurement 
noise is present but its variance is not too large, HOSVD 
orthogonal factors explaining most (but not all) of the variance 
in tensor data can be accurately obtained from the data arrays 
y and y' by retaining only the first columns of the left 
singular vectors of the unfolded data arrays 

r = SVD [1^(1)] 

J ySVD_^^ ^ ^ (24) 

[ TySVD_^3 ^ 

The estimated factor matrices are U = V = 

and W = The core is estimated as 

UK 

Qrst — E y^JkU„VjsWkt. (25) 

ijk 

Now ^ is a compressed version of the noisy tensor y. 

We can now decompose the core tensors to obtain two CP 
models (Ac, Be, Cc) and (A^, B'c, C^). The cores have much 
smaller dimensions than the original tensors, therefore obtain¬ 
ing the compressed CP models is much less computationally 
expensive. To retrieve the uncompressed CP models, we can 
use the fact that (U, V, W)g = (UAc, VBc, WCc), which 
means that the original factors can be recovered by inexpensive 
matrix multiplications 

A = UAc B^VBc C = WCc. (26) 


Note that the bottleneck with this approach is the computation 
of the three SVDs. This can be tackled efficiently using the 
Lanezos algorithms [36] or one of its variations^ recently 
proposed in [37]. 

Coupling in the compressed domain: if we apply the 
HOSVD to the noiseless tensor models X and X', we can 
express the coupling in the compressed space as 

WCc = W'C'c + r (27) 


that is, we suppose C lies in the column space of W and 
C' in the column space of W'. Therefore, we could assume 
the coupled approximation setting in the compressed space 
with the flexible coupling model (27) above. However, if one 
of the data arrays is noisy, for example [y, its corresponding 
estimated HOSVD factor W may have large estimation errors 
and the coupling relation will be no longer true since C will 
not lie in the span of W. On the other hand, if is small, 
factors C and C' are very similar, and as a consequence, the 
left singular vectors of the unfoldings 7V(3) and ^(3) are also 
similar. This indicates that we can obtain a joint orthonormal 
factor as left singular vectors of the stacked and weighted 
unfoldings 


W> N 3 = SVD 


1 ( 3 )^ 


Y 


(3) 



(28) 


“^We adopted Algorithm 5.1 of [37] with oversampling parameter of 2 and 
one power iterate. 


Now the uncompressed coupling relation (22) can be rewritten 
as = \yV^Y and is valid even in the 

presence of observation noise since C and C do not need 
to belong to the column space of the jointly estimated factor 
. Moreover, since compressed parameters are now defined 
in the noisy case as Cc '■= \W^YC and C'c '■= \W^YC', 
the similarity coupling appears directly in the compressed 
dimension 

Cc = C'c + Tc (29) 

where Tc = F is a matrix of same dimensions as Cc 

and with Gaussian i.i.d. entries of variance cr^. The hybrid 
objective function in the compressed domain becomes 


T = — ||g-(Ae,B,,C, 


= tllF + ^ 


\g' -{A'c, B'c, C'c 

+ X \\Cc-C 


F 
/ ||2 


(30) 

which can be solved using the ALS algorithm (Algorithm 1). 


C. Multiplicative updates for non Gaussian coupling of posi¬ 
tive variables 

To minimize the objective function (15) under the constraint 
that all estimates are nonnegative, we can use a multiplicative 
update (MU) algorithm [38], which is a popular method in 
nonnegative matrix factorization. The idea of MU is to alter¬ 
natively update the factors using a gradient descent algorithm 
where the stepsize is chosen as a function of the negative and 
positive parts of the gradients. As a consequence, the gradient 
descent update becomes a simple nonnegative multiplicative 
update [39]. For example, the update at iteration k of the 
factor A is 

Afc=Afc_iS(VAT-0VAT+) (31) 

where VaT+ and VaT“ are the positive and negative parts 
of the gradient VaT = VaT+ — VaT~ evaluated with the 
most recent updates of the factors. Note that this update is 
always positive if the initial conditions are positive. In practice, 
to ensure that divisions by zero do not occur the update is 
modified as follows 

Afe = Afc_i m [VaT^ 0 max (VaT^jEI)] (32) 

where max(-,-) is the element-wise maximum between the 
elements of the two matrix inputs, £ is a small constant and 1 
is a matrix with ones in all elements. After carrying out both 
updates of the form (32) for A and A', we update the other 
factors in a similar way as in ALS. To solve the scale issue, 
we also need to normalize A^, Bk and A^, after each update. 
Since the factors are positive their columns can be normalized 
using the ii norm, i.e. |la||i = instead of the £2 norm. 

Also, similarly to the ALS procedure, to start closer to the 
solution, the coupled MU can be initialized using the results 
given by two uncoupled MU. In this case, the right permutation 
may be found by minimizing the fdc divergence between the 
permuted factor Cq and the fixed factor C^. 

The MU algorithm is described in Algorithm 2, the positive 
and negative terms in the gradients are given in Appendix B. 
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Algorithm 2. MU for positive CP approximation with positive couplings 

Require: y, y\ (j), </>', (l)c, /?, /3', /3c, fcmax, ^min 

1: Apply two uncoupled MU with random initializations to 
obtain Aq, Bq, Cq, Aq, Bq and Cq. 

2: Permute initialization factors columns to have minimal 

^ij '^/3c([Co]y , Polii)- 

3: Evaluate the initial cost function Tq (15). 

4: Set k = 0, AT = +00. 

5: while k < fcmax and ATfc > ATmin, do 
6: k;=k+l. ^ 

7: Update Ak 

Ak = Ak-i m [VaT/: 0 max (VaT^, £l)] 
Normalize columns of 
8: Similarly update Bj., Ck and 

Normalize Af, and Bk, Bf. remaining unnormalized. 

9: Evaluate the cost function Tfc (15) and its relative 

decrease AT^. 

10: end while 

11: return Afc, Bk, Ck, A^, B^. and Ck- 


V. Bayesian and hybrid Cramer-Rao bounds 
A. Introduction 

A lower bound on the MSE can be evaluated using vari¬ 
ations of the Cramer-Rao bound (CRB). In what follows we 
will consider two simplifications of the joint Gaussian model, 
namely the simplified coupling model and unitary first rows 
for uncoupled factors. 

Simplified coupling model and unitary first rows: in both 
models we consider Gaussian couplings between components 
Cr and c(, of factors C and C' of the form 

Cr = Hc'^ -I- 7^, for r G {1, , • • • , i?} (33) 

where iT is a transformation matrix and 7^ is a zero mean 
white Gaussian vector with variance Similarly to [20], 
[21], to resolve the scale ambiguity of the CP models, we 
assume that the components in factors A, B, A' and B' are 
of the form a = [1; a], that is, the elements in their first row 
are known and equal to unity. Without proper constraints on 
the factors, the information matrices that will be defined later 
are not invertible and, as a consequence, the bound cannot be 
evaluated. 

Priors on the factors: we consider a Bayesian model and 
a hybrid model. They differ in the definition of the priors of 
all factors except C'. In the Bayesian model the unknown part 
of the components of A, B, A' and B' have Gaussian priors 
of the form 

CLr = dr + Ja-r (34) 

where dr is a constant vector representing the mean of the 
prior and 7^ is a zero mean white Gaussian vector with 
variance a\. The components of the factor C' have a similar 
prior, except that the elements in the first row are also 
unknown. 

In the hybrid model, we consider that only C is random, 
all the other factors are deterministic. 


Likelihoods: the noise follows the model in (5), so 
that the log-likelihoods can be written as a function of the 
components in the factors as follows 


C = \ogp{y-A,B,C) = 


^ / fl R 

-7 lll/abc - ^ dr ® hr ® Crf + \\y -'^dr'S ic, 

\ r=l r=l 

R R 

+ ll?/bc - dr ( 8 ) Crf + WVr “ Cr 


r=l r=l / 

(35) 

where y indicates the vectorized version of the data array^ 
and the subscripts indicate the free variables underlying the 
elements in the measurement vector. Eor example, in we 
have all data array elements which do not contain index 1 in 
the first mode and contain only index 1 in the second mode. 


B. Bayesian Cramer-Rao bound 

When all factors are random, we can obtain a bound 
on the estimation MSE through the Bayesian Cramer-Rao 
bound (BCRB) [40, pp.4-5]. Eor a vector of parameters 


= vec 


A-B,C,A -B ,C' 

MSE{9f > BCRB 


we have 


(36) 


where the BCRB is a matrix given by the inverse of a data 
related information matrix F plus a prior related information 
matrix P: 


BCRB = (F + P) \ (37) 


In what follows, we describe both matrices F and P. F is 
the average Eisher information matrix given by 


F = 


E[F] 0 
0 E[F'] 


(38) 


where the expectation is evaluated w.r.t. the prior distribu¬ 
tions. The off-diagonal matrices are equal to zero due to the 
conditional independence assumption (HI). F is the Eisher 
information matrix for the first CP model; 



Va^ 

VcC 





(39) 


VxC corresponds to the gradient of the log-likelihoods w.r.t. to 
a vector x. The expectation is evaluated w.r.t. the conditional 
distribution of the data array given the parameters. The matrix 
F' for the other CP model is given in a similar way. The 
gradients w.r.t. components dr, br and c^, which are parts of 
the gradients above, are 


Va„r = - 2 - 

'-'n 

{I SbrS Crfe^l,^ {I S Crf Car 

n 

{drSiS Crfcg^^^ + {IS Crfe^^ 

Vc„r = - U 
<ri 

{dr SbrS if ■+ {Or S if Car 


-y {br S i f 


^The order of the modes for vectorization is third, second and first mode 
(order of running indexes). This implies that Tijfc = 
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Sub matrices 

Fxal 

E{F}xal 

(^r5 ^s) 

(b' b, + l){cjc,)l 

[bibs + 1 + ( J - l)alSrs] { cjH^Hcs + [Ka^ + ti{H^H)a^,] [ / 

(^rJ 6s) 

{cjcs){as SK +1) 

l^cjH'^Hcs + [Ka^, + iiiH^H)al.,]j (a, Sbl + I) 

Cg) 

{bibs + l)(d^ 0 cj) 

[bibs + 1 + (J - l)al6rs]{ds 0 HdJ ) 

(6r, 6s) 

{djas + l){cjcs)l 

(aids + 1 + Ia\6rs) { cjH^Hcs + [Ka^ + ti{H^H)a^,[ \ I 

(67^, Cs) 

{d^ds + l){bs (g) cj) 

[djds -1- 1 -1- (/ - l)a'^Srs[{bs g) iTc'/) 

(^r 5 *^s) 

{djdg + l){b^bs + 1)/ 

[aids + 1 + {I - l)a\6rs][blbs + 1 + {J - l)a%6rs]I 


TABLE I 

Fisher information and average Fisher information sub matrices for pairs of components. 


where e indicates a vector of the residuals, e.g. e^c = Vac ~ 
® which corresponds here to a Gaussian noise 
vector. Using the fact that the noise is white and that {x 0 
y 0 z)~^{x' ^y' IS) z!) = {x^x'){y~^y'){z^z'), we can easily 
evaluate the submatrices in F. They are given in Table I, where 
5rs denotes the discrete delta function. Similar expressions are 
obtained for F'. 

Using the coupling and priors models (33) and (34) and 
the independence hypothesis on the uncoupled factors (H2), 
we can obtain the average Fisher information matrix E(i^), 
which is also given in Table I. 

The matrix P is the information matrix related to the prior 
distribution p{9), it is given by 


P = E 


Vg logp(0)vTlogp(0) 


(41) 


using again the prior models and (H2), we have 


P = 


"A 

0 

0 

0 

0 

0 


0 

/ 

0 

0 

0 

0 


' ^2 


0 

0 

0 

I 

0 

0 


0 

0 

0 

0 

I 

0 


0 

0 

I<g)H 

<yl 

0 

0 

I<S(H^H) . I 

—^^ 
c 


(42) 

The off-diagonal sub matrices are the effects of coupling 
between the two CP models on the information matrix to be 
inverted, the larger is cr^, the smaller is the influence of these 
sub matrices. 


C. Hybrid Cramer-Rao bound 

If we consider the hybrid model where only C is random, 
a bound on the estimation MSB can be obtained through the 
hybrid Cramer-Rao bound (HCRB) ( [40, p. 12]). The HCRB 
is given by 

HCRB = ^. (43) 

- C\C' C\C' ~ 

Matrices F and P^<^ play similar roles as F and P for 
the BCRB. The matrix F is evaluated in a similar way as 
F with the difference that the expectation is taken w.r.t. the 


conditional distribution of the coupled factor p{C\C') only, 
that is 


pC\c' 


Ec|c'[-F] 0 

0 F' 


(44) 


The submatrix F' is directly the Fisher information for the 
second CP model and the elements in Ec|f7/[P] are similar 
to the elements in F, only the components are affected 
by the expectation. For the hybrid model E[cr] = c(, and 
E[cjc«] = + KalSrs- 

The prior matrix is evaluated w.r.t. the distribution of 

~ r 

the random parameters 0 conditioned on the non random 

~Tir 

parameters 6 


pC|<^' = E 


e’’|e" 


v~,\ogp{e \~e )vTiogp(0 \~e ) 


For the hybrid coupled CP model we have 


(45) 


P = 


0 0 
0 0 
0 0 
0 0 
0 0 

0 0 


0 0 0 

0 0 0 

4 0 0 

cr“ 

0 0 0 

0 0 0 

_mif_ 0 0 


0 

0 

I®H 

0 

0 


(46) 


We can still see the off diagonal sub matrices indicating the 
coupling, but parts of the diagonal elements are now equal to 
zero, which means that the hybrid model is less regularized 
than the Bayesian model. 


VI. Simulations 

In this section we simulate the algorithms for estimating the 
latent factors of the coupled CP models in different settings®. 
We focus mainly on the hybrid case where one coupled factor 
is random and all other factors are unknown but considered to 
be deterministic. First we apply the AFS algorithm (Algorithm 
1) to two cases of the joint Gaussian model; direct coupling 
of the C factors and coupling of one component. In the direct 
coupling case, we compare the results with the HCRB. Then 
we show some simulation results on the compressed coupled 
setting. We simulate also the performance of Algorithm 2 on 
the estimation of a Gamma-coupled positive Gamma model. 
At the end of the simulation section, we return to the joint 

^MATLAB® codes related to the simulations reported in this section can 
be found at: www.gipsa-lab.fr/ pierre.comon/TensorPackage. 
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Fig. 1. Total MSE for the factors C and C' of the coupled CP model as a 
function of the coupling intensity ^. Results are shown for ALS algorithms 
considering flexible couplings (Algorithm 1) and hard couplings {C = C). 
The CP models are measured with different noise levels =0.1 and = 
0 . 001 . 
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Gaussian model and simulate the ALS algorithm in the case 
when the C factors are coupled through interpolation. Note 
that the coupling standard deviation CTc in (8) is not considered 
to be a nuisance parameter in these simulations since the norms 
of the columns of the coupled factors are known through 
normalization of the columns of other factors to one. 

All simulation results are given in terms of a Monte-Carlo 
estimation of the total MSE of one or a few factors. The total 
MSE on a factor, for example C, based on different noise 
rplizations is {l/Nr) Y.k=i J 2 n=ii^kr-CJ^r)'^, where 

is the factor estimated in the n — th noise realization. The 
expressions used to evaluate the different signal to noise ratios 
(SNR) in the simulations are given in the Appendix C. 


Fig. 2. MSE for the elements of C and C' of the coupled CP models. The 
models have one similar component (a shared component) on their C factors 
while the other component is not similar. 

and exactly coupled models. The HCRB predicts well this 
transition when the factors are strongly coupled, however a gap 
is present when the models are uncoupled. Moreover in the 
interval I/ctc € [10 ; 100], a flexible coupling works better than 
a hard coupling. Eor l/uc < 10, the performance is similar 
to the uncoupled case, while applying a hard coupling in this 
case will lead to very poor results. Eor 1 /ctc > 100, flexible 
couplings and hard couplings present equivalent performance. 
Note that in Eig. 1, the HCRB for the clean factor practically 
coincides with the “CRB uncoupled and is hence not 
displayed. 


A. Similar factors 

We start with a straightforward hybrid Gaussian coupling 
model C = + E. The two CP models are generated 

randomly (i.i.d. Gaussian N {Q, 1)) with dimensions 1 = 1 ' = 
J = J' = K = K' = 10 and R = R' = 3. The first rows 
in the factors A, B, A' and B are set to 1 , so that we can 
compare the results with the corresponding HCRB. Here the 
columns of factor matrices are thus not normalized. We vary 
the coupling intensity — from 2 to 2 x 10®. The data array y 
is almost noiseless SNR(3;) G [64.77; 65.74] dB (ct„ = 0.001 
(53)), while y has some noise SNR(j;') = 24.77 dB (a'„ = 
0.1 (54)). Algorithm 1 is applied to estimate the CP models 
under 500 different noise and coupling realizations. The total 
MSE on the C and C' factors is evaluated. We also evaluate 
the total MSE for an ALS algorithm using hard couplings, i.e. 
C = C' and the uncoupled CRB for both CP models. Since 
y is generated with different C due to the random coupling, 
we average the resulting CRB. The results are shown in Eig.l. 

We can see that when the coupling is weak the MSE on 
C is close to its uncoupled CRB. An intuitive explanation for 
the behaviour presented in the figure is that, by increasing the 
coupling intensity, the factor is estimated with a much better 
performance, since more information comes from the clean 
tensor through the coupling. The flexible coupling allows to 
assess the continuous transition between uncoupled models 


B. Shared component 

We also simulate the case when components Cy and d-^ are 
shared between the models and the noise levels are similar. 
In this case I = I' = J = J' = K = K' = 10, R = 
R' = 2 and CTc = 0.001. Eactors A, B, A', B' and C' are 
drawn coefficient-wise from standard Gaussian distributions, 
then normalized column-wise with the ^2 norm except for C 
and C', and SNR(j;) « SNR{y) = 9 dB (cr„ = 0.05 and 
a'„ = 0.05 (55)). The MSE on all elements of both C and 
C' are evaluated in a similar way as above. The MSE for the 
estimation of each element of the factors is shown in Eig. 2. 

Although the unshared component is not improved by the 
coupling, we observe in Eig. 2 that the MSE is halved 
when double the amount of data is used (for the shared 
components). We might expect this behaviour when we are 
solving approximately a linear system. 

C. Compressed coupled CP 

Eor large coupled data sets, we simulate the joint compres¬ 
sion strategy on 100 x 100 x 100 tensors^, R = 5 and set 
multilinear rank guesses to Ri = R 2 = R 3 = 5 for both 
coupled tensors. Eactors are drawn randomly similarly to the 

^Larger sizes could have been used but they would require very long Monte- 
Carlo simulations, since we compare the performance with the uncompressed 
strategy. 
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Fig. 3. Reconstruction MSE of noisy factor C with compressed coupled al¬ 
gorithms. Comparison is made with uncoupled and uncompressed algorithms. 

previous example. C is defined from unnormalized C' with 
additive zero mean Gaussian noise of variance CTc = 10“^. 
The SNR for y' is set to 22 dB = 0.0018 (56)), and 
it varies from 4dB to 20 dB for y (a„ € [0.0141; 0.0022] 
(55)). We compare the performance of the coupled estimation 
in the compressed space with the coupled approach in the 
uncompressed space and with three uncoupled approaches: 
standard ALS in the uncompressed space and ALS in the 
compressed space with independent and joint compression. 
Results for N = 50 noise realizations and 50 iterations of 
the coupled algorithms are shown in Fig. 3. Compression was 
computed with three randomized SVD from [37]. Moreover, 
initializations were given by two uncoupled uncompressed 
ALS with 1000 iterations, themselves randomly initialized. 

As expected, compression decreases estimation performance 
in both coupled and uncoupled cases, even more as the noise 
level increases. But it is worth noting that even coupling 
only in the compression improves performance w.r.t. simply 
running an independent ALS on Also, the joint-compressed 
coupled case behaves well. On the other hand, compressing 
dramatically improves the runtime of joint decompositions 
which can be quite slow. Table II shows a clear difference in 
computation time*. Also, compressing allows to go to much 
higher dimensions than the ones used in these simulations. 


Joint-Compressed No Compression 



Standard deviation (s) < 0.01 0.08 


TABLE II 

Mean runtime for 50 iterations of the joint-compressed 
COUPLED decomposition ALGORITHM AND THE COUPLED ALGORITHM 
WITHOUT COMPRESSION. MEAN IS TAKEN OVER ALL REALIZATIONS AND 
ALL NOISE LEVELS. 


D. Gamma coupling of positive variables 

In the case of positive variables, we simulate the coupling 
of positive CP models with all dimensions equal to 10 and 
R = 3. Both the measurement and coupling models follow a 
Gamma distribution, that is /3 = /?' = /?c = 2 in the Tweedie’s 

^Computational time is shown for a MATLAB® implementation of the 
algorithms. The algorithms are tested on a 3.2 GHz Intel® Core™ i5 with 
8Gb of RAM. 


distribution, with dispersion parameters f = 0.5, f = 0.05 
and fc = 0.05^. We generate the CP models randomly. 
The elements in the factors are the absolute values of i.i.d. 
Gaussian variables ff{0, 1) and the columns of A, B, A' and 
B' are normalized with their norms. Both uncoupled and 
coupled (Algorithm 2) algorithms are applied to 50 different 
realizations of the noise distribution, in this case multiplicative 
noise. We impose the constraint that 2 components in A are 
nearly collinear (correlation coefficient of 0.99997). The total 
MSE for the factors are shown in Table III. 



A 

B 

C 

A' 

B' 

C 

Uncoupled 

0.041 

0.054 

4.904 

0.001 

0.001 

0.129 

Coupled 

0.015 

0.021 

0.803 

0.001 

0.001 

0.127 


TABLE III 

Total MSE for the estimation of the factors in the coupled CP 
MODELS with GAMMA LIKELIHOODS AND COUPLINGS. TWO 
COMPONENTS IN A ARE NEARLY COLLINEAR. THE RESULTS ARE SHOWN 
FOR BOTH UNCOUPLED AND COUPLED (ALGORITHM 2) MU 
ALGORITHMS. 

The results show a large gain in performance in all factors 
in the noisy tensor. The fact that A has nearly collinear 
columns influences strongly the conditioning of the uncoupled 
problem. In the Gamma coupled case, information flows from 
the clean tensor to the noisy tensor in an heterogeneous way. 
The elements in C' are linked to the variations in C', since 
the model is multiplicative. This heterogeneous information 
flow allows to better estimate C and, since the model is ill 
conditioned, a reduction on the estimation error in C will 
greatly affect the estimation performance of A and B. 

E. Different sampling rates 

As another non trivial example of coupling we consider the 
case when I = I' = J = J' = 10, but with different sizes 
on the third mode K — 37 and K' = 53. We suppose that 
the components on C and C' are uniformly sampled versions 
of the same underlying continuous functions, however, the 
sampling periods to obtain the factors are different so that the 
elements in the factors cannot be similar, at least, in most of 
the points. Since the factors are not similar, we cannot apply 
the direct coupling model and we must use an interpolation 
approach as explained at the end of Sec. III. 

To obtain the interpolation kernel we assume that the 
sampled functions are band-limited and periodic. For an odd 
number of samples, the interpolation kernel is given by the 
Dirichlet kernel [41] 

Ksm{n[ti-R\/[{L-1)T^} ^ ’ 

where ti = {I — 1)T^ with I = 1, ..., L is the uniform 
interpolation grid with interpolation period Tf tk = {k— 1)T 
with fc = 1, ..., K and = {k' — 1)T' with k' = 1, K' 
are the original sampling grids with sampling periods T and 

r. 

^Note that here defining a SNR is difficult since the noise power depends 
on the signal power. 
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Sampling grid for C 

0 1 2 3 4 



Fig. 4. Underlying continuous function ci(f) for the first components of the 
C factors and their corresponding sampled versions ci and c'^ obtained with 
different sampling grids. 


We simulate two random CP models, R = R' = 3 and 
normalize factors A, B, A' and B'. The components on 
the C factors are generated by regularly sampling Cr{t) = 
^ir sin(27r/if) in the time interval t € [0; 4] where 
air are drawn from i.i.d. Gaussian variables Af{Q, 1). We 
assume Nc = 3 and fi = 2.05, /2 = 2.55, /a = 3.5. 
Observe that the first two components do not have an integer 
number of periods in the considered time duration, therefore 
these components can be only approximated and not exactly 
reconstructed with the assumed kernel. The sampling periods 
T = 4/37, T' = 4/53 are chosen to have K and K' samples 
in the given time duration. An example of continuous-time 
component with its sampled points on different grids is shown 
in Fig. 4. 

We fix SNR(V') = 26.46 dB (cr; = 0.01 (57)), while 
SNR(V) varies from 6.46 dB to 46.46 dB (ct„ € [0.001; 0.1] 
(57)) so that SNR(V') - SNR(V) G [-20; 20]. For the 
considered model, interpolation can only approximate the 
continuous-time signal. Therefore, it is necessary to set a 
nonzero CTc even if the continuous signals are the same for both 
data sets. We set L = 100 (T^ = 4/100), CTc = 0.15 and we 
evaluate the total MSB on the coupled factors with 200 noise 
realizations in the same way as before. We also evaluate the 
total MSB for approximately hard couplings (CTc = 10“^). The 
results are shown in Big. 5. When the noise ratio increases, the 
total MSB for the uncoupled approach increases sharply, while 
the coupled approach has a smooth increase. This shows that 
even though the coupled factors are not similar, information 
can still be exchanged between them through interpolation. 
We can also see that an approximately hard coupling approach 
does not lead to good results in this case. 

We consider also the case when both data arrays are noisy 
SNR(V') « SNR(V) = 6.46 dB (cr„ = < = 0.1 (57)) 
and the number of interpolation points for the coupling L 
varies from 5 to 75. We use the previously considered coupling 
parameters CTc for the flexible and approximately hard coupling 
approaches. Total MSB values obtained with 200 different 
realizations of noise are shown in Big. 6. 

By increasing the number of interpolation points the infor¬ 
mation exchanged within the model is larger and the MSB 



Fig. 5. Total MSE for the estimation of the C factors for different 
SNR(y') — SNR(y). The CP models are coupled through interpolation. 
Estimation is carried out considering three different models: uncoupled, 
approximately hard coupled (tJc = and flexibly coupled (ctc = 0.15). 



10 20 30 40 50 60 70 

Number of coupling samples - L 


Fig. 6. Total MSE for the estimation of the C factors as a function of the 
number of interpolation samples. Estimation is carried out considering three 
different models: uncoupled, approximately hard coupled model (ctc = 10“^) 
and flexibly coupled (ctc = 0.15). The lines above and below each plot 
indicate 95% confidence intervals for the estimation of the total MSE. 


decreases almost linearly for the flexible coupling approach. 
Above L = 53, only a small quantity of information can be 
exchanged because this is the maximum resolution present in 
the data and the total MSB curve becomes almost flat. The 
approximately hard coupling approach gives its best result 
at coarsest resolution L = 37, which is better than flexible 
coupling for this particular value of L but worse than the best 
results for the flexible coupling approach (L > 53). Observe, 
as indicated by the confidence intervals in the figure, that the 
total MSB oscillations seem to be linked to a deterministic 
pattern and not to the specific set of noise realizations. 

As a last simulation we consider the case when one tensor 
is clean SNR(V) = 46.67 dB (cr„ = 0.001 (57)) and is of size 
10 X 10 X 24 with a sampling period T = 4/24 in the third 
mode. The other tensor is very noisy SNR([y) = —5.61 dB 
((j„ = 0.001 (57)), but it has a higher resolution in the third 
mode, its size is 10 x 10 x 37 and sampling period T' = 4/37. 
The frequencies in the components are fi = 3.22, /2 = 3.47, 
/s = 3.73, as a consequence, the first tensor cannot be used 
to recover the continuous components in the third mode since 
its sampling frequency is below 2 /i. 

We cannot couple the CP models using interpolation on 
both factors, since the continuous-time components cannot be 
reconstructed from the factor C, however we can interpolate 
the high resolution C' to have the same low resolution 
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grid as C and then couple them using a flexible coupling 
(CTc = 0.15). We simulated this approach using uncoupled and 
coupled algorithms. The algorithms are applied to 200 noise 
realizations. After retrieving factors C and C', we interpolate 
both factors C and C' using the sampling kernel (47) with 
a very fine grid (5000 points) to simulate the continuous 
function and we integrate numerically the squared error (using 
trapezoidal integration), so that we have a total squared error 
which approximates the integral of the continuous squared 
error. We then average the total squared errors. The result is 
given in Table VI-E. 


Algorithm 

Factor 

Total squared error 

Uncoupled 

C 

33.491 

Coupled 

C 

33.494 

Uncoupled 

c 

3.959 

Coupled 

c 

1.061 


TABLE IV 

Total MSE for the reconstruction of the continuous 

BAND-LIMITED COMPONENTS FROM FACTORS C AND C' USING 
UNCOUPLED AND COUPLED MODELS. 

Although the factor C does not have enough information 
to recover within some acceptable accuracy the underlying 
continuous function, with or without the coupling, it still 
contains a lot of information which can be transferred through 
the interpolation coupling. The information contained in C 
allows to better estimate C' and, as a consequence, to better 
reconstruct the continuous function. 
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Fig. 7. Mean of the ratio between the cost function values for the coupled 
CP ALS using random initialization (cold start) Tc and the results given by 
uncoupled ALS (warm start) 'Tw 

In this work we have focused on the MAP estimation of 
the latent factors, where all parameters of the coupling model 
are considered to be known. Also, to retrieve the factors, 
we proposed modifications of well-known easy to implement 
algorithms. Future directions of work in this subject may 
include MMSE estimation of parameters using Markov chain 
Monte Carlo techniques, full Bayesian approaches to estimate 
the parameters of the coupling model when they are unknown, 
evaluation of the effect of the flexible coupling models on 
tensor completion and comparison of performance of different 
optimization approaches. 

Appendix A 

Warm start and cold start 


VII. Conclusions 

Since the expression of a phenomenon can be different 
in various data sets, the link between factorizations of the 
data sets should be somehow flexible. To give a meaning to 
this flexibility we have proposed in this paper a Bayesian 
setting for the coupling between factors of the tensor CP 
model, which encompasses matrix-matrix and matrix-tensor 
couplings. Under this setting, we can explore multimodal data 
fusion problems by proposing not only trivial flexible links 
between factors, e.g. i.i.d. Gaussian models for the differences 
between factors having common dimensions, but also joint 
Gaussian models in transformed domains, sparse similarity 
models and positive similarity models. 

We proposed algorithms to tackle the underlying optimiza¬ 
tion problems, taking into account scaling ambiguities. The 
latter turn out to play a key role in coupled models, and 
needed a special care. To allow scalability to large datasets, 
a compression scheme for coupled data has been derived; 
this scheme is being extended to deal with more general 
coupling models. Finally, we provided extensive applications 
of the Bayesian framework and evaluated estimation perfor¬ 
mances w.r.t. the computed CRB for both hybrid and joint 
Gaussian models. We found that coupling can help improving 
the conditioning of ill-conditioned problems, and improves 
estimation accuracy overall. In a large number of applications, 
the flexibility offered by our coupling approach has revealed 
to be of crucial importance. 


Here we give some simulations to illustrate the gap in 
performances between coupled algorithms initialized with two 
independent ALS (or any other classical tensor algorithm) 
and two random guesses. The experiments were done for a 
Gaussian coupling model of the type C = C'+T. The ratio of 
the averaged cost functions are shown in Fig. 7 for different Uc- 
The tensor dimensions are I = J = K = I' = J' = K' = 10 
and R = 3. The noise levels are (t„ = 0.03 and (t„ = 0.01 
and the number of realizations is 100. Using warm start seems 
better than cold start. In practice, a much larger number of 
swamps has been observed in cold start than in warm start. 


Appendix B 

Gradient matrices for the MU algorithm 


The negative and positive terms of the gradient matrix of 
cost function (15) w.r.t. factor A are 




(48) 


where = A{C Q B)^ is the first unfolding of the CP 
model X. For the factor B we have 


VTb {V(2) 0 [X(2)] ^}{cqA) 
VT+ =i[A:(2)]'-^(C©A) 


( 49 ) 
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where -X'( 2 ) = B (C Q A)^. For factor C an additional term 
appear due to the coupling 

VT5 .1 {r<3, □ [Xp,]-} (BeA) + 

VTJ -i [X<„](B 0 A) + f (. 0 C) + gig. 

( 50 ) 

The gradients w.r.t. A and B have the same form as (48) 
and (49), while the gradient w.r.t. C' is given by 

+^{c^[cr-} (51) 

VTg [Xg](S' eA') + ^^ [C] . 

Appendix C 

SNR FOR SOME COUPLING SCENARIOS 

In this appendix, we present the expressions for the SNR 
of the tensors y and y' considered in simulations (Sec. VI) 
following the hybrid Gaussian models. We present the SNR for 
the direct coupling model with unnormalized and normalized 
factors (Subsec. VI-A and VI-C) and for different samplings 
of a sum of sine functions (Subsec. VI-E). 

a) Direct coupling: we recall briefly the simulation con¬ 
ditions: factors A, B, A', B' and C' are drawn independently 
according to Al(0,1), the noise tensors V and V' are drawn 
from i.i.d. A/'(0, erg and A/'(0, a'^) respectively. The first rows 
of A, B, A' and B' are set to 1 in the unnormalized case, 
while their columns are divided by their respective £2 norms 
in the normalized cases. The elements Ci j are drawn from 

In the unnormalized case, the SNR for tensor ^ is written 
as follows: 

SNR(y) = 10 logio (e [^1^]) (52) 

= 101ogio(^^) (53) 

while tensor y' has a more usual SNR given by: 

SNR(y) = 101ogio(^^^. (54) 

In the normalized case we have 

SNR{y) = 10 logio (^^^ 77 ^) (55) 

and 

SNR(y) = 101ogio(^-^^). (56) 

Note that for r < R shared components, as the 
model considered in Subsec. VI-B, SNR(3^) changes to 
lOlogio [{ra^ + R)/iIJal)]. 


b) Sum of sines: in Subsec. VI-E factors A, B, A' and 
B' are generated in the same way as in the previous paragraph, 
however the columns and g are generated by sampling with 
sampling periods T and T' the continuous functions Cr{t) = 
ctir sin{2TTfit) with i.i.d. air drawn from Af{0, 1). The 
SNR of the tensor models is 


SNRCF) = lOlogio 


^E£iEf^iSin^(27r/.fcr) 

IJKal 
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